Graphics Processing Units (GPUs) are commonly used for deep learning
acceleration due to their massive parallel throughput compared to CPUs. LiteRT
Next simplifies the process of using GPU acceleration by allowing users to
specify the hardware acceleration as a parameter when creating a Compiled Model
(CompiledModel
). LiteRT Next also uses a new and improved GPU acceleration
implementation, not offered by LiteRT.
With LiteRT Next's GPU acceleration, you can create GPU-friendly input and output buffers, achieve zero-copy with your data in GPU memory, and execute tasks asynchronously to maximize parallelism.
For example implementations of LiteRT Next with GPU support, refer to the following demo applications:
Add GPU dependency
Use the following steps to add GPU dependency to your Kotlin or C++ application.
Kotlin
For Kotlin users, the GPU accelerator is built-in and does not require additional steps beyond the Get Started guide.
C++
For C++ users, you must build the dependencies of the application with LiteRT
GPU acceleration. The cc_binary
rule that packages the core application logic
(e.g., main.cc
) requires the following runtime components:
- LiteRT C API shared library: the
data
attribute must include the LiteRT C API shared library (//litert/c:litert_runtime_c_api_shared_lib
) and GPU-specific components (@litert_gpu//:jni/arm64-v8a/libLiteRtGpuAccelerator.so
). - Attribute dependencies: The
deps
attribute typically includes GLES dependenciesgles_deps()
, andlinkopts
typically includesgles_linkopts()
. Both are highly relevant for GPU acceleration, since LiteRT often uses OpenGLES on Android. - Model files and other assets: Included through the
data
attribute.
The following is an example of a cc_binary
rule:
cc_binary(
name = "your_application",
srcs = [
"main.cc",
],
data = [
...
# litert c api shared library
"//litert/c:litert_runtime_c_api_shared_lib",
# GPU accelerator shared library
"@litert_gpu//:jni/arm64-v8a/libLiteRtGpuAccelerator.so",
],
linkopts = select({
"@org_tensorflow//tensorflow:android": ["-landroid"],
"//conditions:default": [],
}) + gles_linkopts(), # gles link options
deps = [
...
"//litert/cc:litert_tensor_buffer", # litert cc library
...
] + gles_deps(), # gles dependencies
)
This setup allows your compiled binary to dynamically load and use the GPU for accelerated machine learning inference.
Get started
To get started using the GPU accelerator, pass the GPU parameter when creating
the Compiled Model (CompiledModel
). The following code snippet shows a basic
implementation of the entire process:
C++
// 1. Load model
LITERT_ASSIGN_OR_RETURN(auto model, Model::CreateFromFile("mymodel.tflite"));
// 2. Create a compiled model targeting GPU
LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create({}));
LITERT_ASSIGN_OR_RETURN(auto compiled_model, CompiledModel::Create(env, model, kLiteRtHwAcceleratorGpu));
// 3. Prepare input/output buffers
LITERT_ASSIGN_OR_RETURN(auto input_buffers, compiled_model.CreateInputBuffers());
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());
// 4. Fill input data (if you have CPU-based data)
input_buffers[0].Write<float>(absl::MakeConstSpan(cpu_data, data_size));
// 5. Execute
compiled_model.Run(input_buffers, output_buffers);
// 6. Access model output
std::vector<float> data(output_data_size);
output_buffers.Read<float>(absl::MakeSpan(data));
Kotlin
// Load model and initialize runtime
val model =
CompiledModel.create(
context.assets,
"mymodel.tflite",
CompiledModel.Options(Accelerator.GPU),
env,
)
// Preallocate input/output buffers
val inputBuffers = model.createInputBuffers()
val outputBuffers = model.createOutputBuffers()
// Fill the first input
inputBuffers[0].writeFloat(FloatArray(data_size) { data_value /* your data */ })
// Invoke
model.run(inputBuffers, outputBuffers)
// Read the output
val outputFloatArray = outputBuffers[0].readFloat()
For more information, see the Get Started with C++ or Get Started with Kotlin guides.
LiteRT Next GPU Accelerator
The new GPU Accelerator, available only with LiteRT Next, is optimized to handle AI workloads, like large matrix multiplications and KV cache for LLMs, more efficiently than previous versions. The LiteRT Next GPU Accelerator features the following key improvements over the LiteRT version:
- Extended Operator Coverage: Handle larger, more complex neural networks.
- Better Buffer Interoperability: Enable direct usage of GPU buffers for camera frames, 2D textures, or large LLM states.
- Async Execution support: Overlap CPU pre-processing with GPU inference.
Zero-copy with GPU acceleration
Using zero-copy enables a GPU to access data directly in its own memory without the need for the CPU to explicitly copy that data. By not copying data to and from CPU memory, zero-copy can significantly reduce end-to-end latency.
The following code is an example implementation of Zero-Copy GPU with OpenGL, an API for rendering vector graphics. The code passes images in the OpenGL buffer format directly to LiteRT Next:
// Suppose you have an OpenGL buffer consisting of:
// target (GLenum), id (GLuint), size_bytes (size_t), and offset (size_t)
// Load model and compile for GPU
LITERT_ASSIGN_OR_RETURN(auto model, Model::CreateFromFile("mymodel.tflite"));
LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create({}));
LITERT_ASSIGN_OR_RETURN(auto compiled_model,
CompiledModel::Create(env, model, kLiteRtHwAcceleratorGpu));
// Create a TensorBuffer that wraps the OpenGL buffer.
LITERT_ASSIGN_OR_RETURN(auto tensor_type, model.GetInputTensorType("input_tensor_name"));
LITERT_ASSIGN_OR_RETURN(auto gl_input_buffer, TensorBuffer::CreateFromGlBuffer(env,
tensor_type, opengl_buffer.target, opengl_buffer.id, opengl_buffer.size_bytes, opengl_buffer.offset));
std::vector<TensorBuffer> input_buffers{gl_input_buffer};
LITERT_ASSIGN_OR_RETURN(auto output_buffers, compiled_model.CreateOutputBuffers());
// Execute
compiled_model.Run(input_buffers, output_buffers);
// If your output is also GPU-backed, you can fetch an OpenCL buffer or re-wrap it as an OpenGL buffer:
LITERT_ASSIGN_OR_RETURN(auto out_cl_buffer, output_buffers[0].GetOpenClBuffer());
Asynchronous execution
LiteRT's asynchronous methods, like RunAsync()
, let you schedule GPU inference
while continuing other tasks using the CPU or the NPU. In complex pipelines, GPU
is often used asynchronously alongside CPU or NPUs.
The following code snippet builds on the code provided in the Zero-copy GPU
acceleration example. The code uses both CPU and GPU
asynchronously and attaches a LiteRT Event
to the input buffer. LiteRT Event
is responsible for managing different types of synchronization primitives, and
the following code creates a managed LiteRT Event object of type
LiteRtEventTypeEglSyncFence
. This Event
object ensures that we don't read
from the input buffer until the GPU is done. All this is done without involving
the CPU.
LITERT_ASSIGN_OR_RETURN(auto env, Environment::Create({}));
LITERT_ASSIGN_OR_RETURN(auto compiled_model,
CompiledModel::Create(env, model, kLiteRtHwAcceleratorGpu));
// 1. Prepare input buffer (OpenGL buffer)
LITERT_ASSIGN_OR_RETURN(auto gl_input,
TensorBuffer::CreateFromGlBuffer(env, tensor_type, opengl_tex));
std::vector<TensorBuffer> inputs{gl_input};
LITERT_ASSIGN_OR_RETURN(auto outputs, compiled_model.CreateOutputBuffers());
// 2. If the GL buffer is in use, create and set an event object to synchronize with the GPU.
LITERT_ASSIGN_OR_RETURN(auto input_event,
Event::CreateManagedEvent(env, LiteRtEventTypeEglSyncFence));
inputs[0].SetEvent(std::move(input_event));
// 3. Kick off the GPU inference
compiled_model.RunAsync(inputs, outputs);
// 4. Meanwhile, do other CPU work...
// CPU Stays busy ..
// 5. Access model output
std::vector<float> data(output_data_size);
outputs[0].Read<float>(absl::MakeSpan(data));
Supported models
LiteRT Next supports GPU acceleration with the following models. Benchmark results are based on tests run on a Samsung Galaxy S24 device.
Model | LiteRT GPU Acceleration | LiteRT GPU (ms) |
---|---|---|
hf_mms_300m | Fully delegated | 19.6 |
hf_mobilevit_small | Fully delegated | 8.7 |
hf_mobilevit_small_e2e | Fully delegated | 8.0 |
hf_wav2vec2_base_960h | Fully delegated | 9.1 |
hf_wav2vec2_base_960h_dynamic | Fully delegated | 9.8 |
isnet | Fully delegated | 43.1 |
timm_efficientnet | Fully delegated | 3.7 |
timm_nfnet | Fully delegated | 9.7 |
timm_regnety_120 | Fully delegated | 12.1 |
torchaudio_deepspeech | Fully delegated | 4.6 |
torchaudio_wav2letter | Fully delegated | 4.8 |
torchvision_alexnet | Fully delegated | 3.3 |
torchvision_deeplabv3_mobilenet_v3_large | Fully delegated | 5.7 |
torchvision_deeplabv3_resnet101 | Fully delegated | 35.1 |
torchvision_deeplabv3_resnet50 | Fully delegated | 24.5 |
torchvision_densenet121 | Fully delegated | 13.9 |
torchvision_efficientnet_b0 | Fully delegated | 3.6 |
torchvision_efficientnet_b1 | Fully delegated | 4.7 |
torchvision_efficientnet_b2 | Fully delegated | 5.0 |
torchvision_efficientnet_b3 | Fully delegated | 6.1 |
torchvision_efficientnet_b4 | Fully delegated | 7.6 |
torchvision_efficientnet_b5 | Fully delegated | 8.6 |
torchvision_efficientnet_b6 | Fully delegated | 11.2 |
torchvision_efficientnet_b7 | Fully delegated | 14.7 |
torchvision_fcn_resnet50 | Fully delegated | 19.9 |
torchvision_googlenet | Fully delegated | 3.9 |
torchvision_inception_v3 | Fully delegated | 8.6 |
torchvision_lraspp_mobilenet_v3_large | Fully delegated | 3.3 |
torchvision_mnasnet0_5 | Fully delegated | 2.4 |
torchvision_mobilenet_v2 | Fully delegated | 2.8 |
torchvision_mobilenet_v3_large | Fully delegated | 2.8 |
torchvision_mobilenet_v3_small | Fully delegated | 2.3 |
torchvision_resnet152 | Fully delegated | 15.0 |
torchvision_resnet18 | Fully delegated | 4.3 |
torchvision_resnet50 | Fully delegated | 6.9 |
torchvision_squeezenet1_0 | Fully delegated | 2.9 |
torchvision_squeezenet1_1 | Fully delegated | 2.5 |
torchvision_vgg16 | Fully delegated | 13.4 |
torchvision_wide_resnet101_2 | Fully delegated | 25.0 |
torchvision_wide_resnet50_2 | Fully delegated | 13.4 |
u2net_full | Fully delegated | 98.3 |
u2net_lite | Fully delegated | 51.4 |
hf_distil_whisper_small_no_cache | Partially delegated | 251.9 |
hf_distilbert | Partially delegated | 13.7 |
hf_tinyroberta_squad2 | Partially delegated | 17.1 |
hf_tinyroberta_squad2_dynamic_batch | Partially delegated | 52.1 |
snapml_StyleTransferNet | Partially delegated | 40.9 |
timm_efficientformer_l1 | Partially delegated | 17.6 |
timm_efficientformerv2_s0 | Partially delegated | 16.1 |
timm_pvt_v2_b1 | Partially delegated | 73.5 |
timm_pvt_v2_b3 | Partially delegated | 246.7 |
timm_resnest14d | Partially delegated | 88.9 |
torchaudio_conformer | Partially delegated | 21.5 |
torchvision_convnext_tiny | Partially delegated | 8.2 |
torchvision_maxvit_t | Partially delegated | 194.0 |
torchvision_shufflenet_v2 | Partially delegated | 9.5 |
torchvision_swin_tiny | Partially delegated | 164.4 |
torchvision_video_resnet2plus1d_18 | Partially delegated | 6832.0 |
torchvision_video_swin3d_tiny | Partially delegated | 2617.8 |
yolox_tiny | Partially delegated | 11.2 |